Initially filtering on 3 things:
--remove-indels--min-alleles 2 and
--max-alleles 2--thin 100--minQ 30vcftools \
--gzvcf PHHA.vcf.gz \
--remove-indels \
--min-alleles 2 \
--max-alleles 2 \
--thin 100 \
--minQ 30 \
--remove-filtered-all \
--recode \
--recode-INFO-all \
--out PHHA.bithin.q30Produces file PHHA.bithin.q30.recode.vcf
Making a summary file to look at individual missingness
Produces file called PHHA.bithin.q30.imiss
256 of 256 individuals
669,819 SNPs
Just looking for some natural breaks here. Wanting to remove
individuals that look like obvious outliers without drastically cutting
certain populations or lowering within-population sampling.
0.88 looks like a decent threshold. If we wanted to be
more aggressive, another option would be 0.85 (would
remove another 11 individuals). Probably can’t go much lower without
losing DA and CA populations entirely.
Below showing individual totals for each population based on 0.88 threshold.
Individual depth (averaged across all loci). Filtered out individuals >88% missigness as identified above.
245 of 256 individuals
669,819 SNPs
snp_set <- filter(full_depth, depth_adj < 12) %>%
select(chr, pos)
write_delim(snp_set, '../filtering/sites_maxdp12.tsv', delim = '\t', col_names = F)